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FACE DETRrTTOXr 
This invention relates to face detection. 

Many human-face detection algorithms have been proposed in the literature 
5 including the use of so-called eigenfaces, face template matching, deformable template 
matchmg or neural network classification. None of these is perfect, and each generally has 
associated advantages and disadvantages. None gives an absolutely reliable indication that 
an image contains a face; on the contrary, they are all based upon a probabihstic assessment 
based on a mathematical analysis of the image, of whether the image has at least a certain 
hkehhood of containing a face. Depending on their application, the algorithms generally 
have the threshold likelihood value set quite high, to try to avoid false detections of faces. 

In any sort of block-based analysis of a possible face, or an analysis involving a 
comparison between the possible face and some pre-derived data indicative of the presence 
of a face, there is a possibility that the algorithm will be confiised by an image region which 
while possibly looking nothing like a face, may possess certain image attributes to pass the' 
comparison test. Such a region may then be assigned a high probability of containing a face, 
and can lead to a false-positive face detection. 

It is a constant aim in this technical field to improve the reliabihty of face detection, 
including reducing the occurrence of false-positive detections. 

This invention provides video face detection apparatus in which a test image from a 
video sequence is compared with an image property model derived from image properties of 
a region detected to contain a face in a preceding image in the video sequence; the apparatus 
comprising: 

means for selecting a predetennined proportion of pixels in the region detected to 
contain a face in the preceding image which most closely match the image property model 
derived in respect of that region, thereby deriving a pixel mask; and 

means for comparing pixels in the test image defined by the pixel mask with the 
image property model, the mask being applied at more than one image position within the 
test image; a face being detected in the test image at a mask position corresponding to a 
lowest average difference between the image property model and pixels defined by the mask 
at that position. 

The invention provides for the use of the most appropriate portion of pixels, being 
that portion which most closely matches the image property model, in a face detection 
process. This can give a more reliable result. 
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It will be appreciated that the term "preceding image" and the like refer to an order of 
testmg of the .mages, not necessarily to a forward temporal order of the video sequence 

Various respective aspects and features of the invention are defined in the appended 

Claims. 

5 Embodiment, of .he invention will now be described, by way of example only witt, 

reference to the accompanying drawings, throughout which like parts are defined by like 

numerals, and in which: 

Figure 1 is a schematic diagram of a general purpose computer system for use as a 
tace detection system and/or a non-linear editing system; 

' Figure 2 is a schematic diagram of a video camera-recorder (camcorder) using face 

detection; >' s <= 

Figure 3 is a schematic diagram illustrating a training process; 
Figure 4 is a schematic diagram illustrating a detection process; 
Figure 5 schematically illustrates a feature histogram; 

Figure 6 schematically illustrates a sampling process to generate eigenblocks; 
Figures 7 and 8 schematically illustrates sets of eigenblocks; 

Figure 9 schematically illustrates a process to build a histogram representing a block 
position; 

Figure 10 schematically illustrates the generation of a histogram bin number; 
Figure 1 1 schematically illustrates the calculation of a face probability- 
Figures 12a to 12f are schematic examples of histograms generated using the above 

methods; 

Figures 13a to 13g schematically illustrate so-called multiscale face detection; 
Figure 14 schematically illustrates a face tracking algorithm; 

Figures 15a and 15b schematically illustrate the derivation of a search area used for 
skin colour detection; 

Figure 16 schematically illustrates a mask applied to skin colour detection; 
Figures 17a to 17c schematically illustrate the use of the mask of Figure 16; 
Figure 1 8 is a schematic distance map; 

Figures 19a to 19c schematically illustrate the use of face tracking when applied to a 
Video scene; 

Figure 20 schematically illustrates a display screen of a non-linear editing system; 

Figures 21a and 21b schematically illustrate clip icons; and 

Figures 22a to 22c schematically illustrate a gradient pre-processing technique. 
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Figure 1 is a schematic diagram of a general purpose computer system for use as a 
face detection system and/or a non-linear editing system. The computer system comprises a 
processing until 10 having (amongst other conventional components) a central processing 
umt (CPU) 20, memory such as a random access memory (RAM) 30 and non-volatile 
> storage such as a disc drive 40. The computer system may be comiected to a network 50 
such as a local area network or the Internet (or both). A keyboard 60, mouse or other user 
input device 70 and display screen 80 are also provided. The skilled man will appreciate that 
a general purpose computer system may include many other conventional parts which need 
not be described here. 

Figure 2 is a schematic diagram of a video camera-recorder (camcorder) using face 
detection. The camcorder 100 comprises a lens 110 which focuses an image onto a charge 
coupled device (CCD) image capture device 120. The resulting image in electronic for™ is 
processed by image processing logic 130 for recording on a recording medium such as a tape 
cassette 140. The images captured by the device 120 are also displayed on a user display 
150 which may be viewed through an eyepiece 160. 

To capture sounds associated with the images, one or more microphones are used 
These may be external microphones, in the sense that they are comiected to the camcorder by 
a flexible cable, or maybe mounted on the camcorder body itself. Analogue audio signals 
from the microphone (s) are processed by an audio processing arrangement 170 to produce 
appropnate audio signals for recording on the storage medium 140. 

It is noted that the video and audio signals may be recorded on the storage medium 
140 in either digital form or analogue form, or even in both forms. Thus, the image 
processing arrangement 130 and the audio processing arrangement 170 may include a stage 
of analogue to digital conversion. 

The camcorder user is able to control aspects of the lens 1 lO's performance by user 
controls 180 which influence a lens control arrangement 190 to send electrical control 
signals 200 to the lens 110. T>T,ically, attributes such as focus and zoom are controlled in 
this way, but the lens aperture or other attributes may also be controlled by the user. 

Two further user controls are schematically illustrated. A push button 210 is 
provided to initiate and stop recording onto the recording medium 140. For example one 
push of the control 210 may start recording and another push may stop recording, or the 
control may need to be held in a pushed state for recording to take place, or one push may 
start recordmg for a certain timed period, for example five seconds. In any of these 
an-angements, it is technologically very straightforward to establish from the camcorder's 
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record operation where fte begimting and end of each "sho." (continuous period of 

recording) occurs. 

The ofter user control shown schematically to Figure 2 is a "good sho, marker" 
(GSM) 220, which may be operated by the user to cause "metadata" (associated data) to be 
5 stored m connection with the video and audio material on the recording medium 140 
mdtoatmg tiaa, this particular shot was subjectively considered by the operator to be "good" 
.n some respect (for «,ample. the actors perfonned particularly weU; the news reporter 
pronounced each word correctly; and so on). 

The metadata may be recorded in some spare capacity (e.g. "user dau") on the 
' recording medium 140. depending on the particular fonnat and standa«i in use 
A.,en,atively *e metadata can be stored on a separate storage medium such as a removable 
MemotyStick ™ memory (not shown), or the metadata could be stored on an external 
database (not shown), for example bemg communicated to such a database by a wireless link 
(no, shown). The metadata can include not only the GSM information but also shot 
boundaries, lens attiibutes. alphanumeric info™,a,ion input by a user (e.g. on a keyboard - 
no, shown), geographical position information ftom a global positioning system receiver (no. 
shown) and so on. 

So fer. the description has covered a metadata-enabled camcorder. Now, tiie way to 
which face detection may be applied lo such a camcorder will be described. 

The camcorder includes a face detecor arrangement 230. Appropriate arrangemenB 
w,ll be described to much grea,er detail below, bu, for Bus part of the description i, is 
sufficen, to say that the face detector atrangemen. 230 receives images from the mtage 
processmg arrangement 130 and detects, or attempts to detect whetiier such images contain 
one or more faces. The face detector may output fece detection data which could be to the 
fotm of a "yes/no" flag or maybe more detailed to fl,at the data could include the torage co- 
ordmates of the faces, such as the co-ordma,es of eye positions withm each detected face 
Thts mformation may be Beated as another type of metadata and stoted m any of the oUter 
formats described above. 

As described below, face detection may be assisted by using oU,er types of metadata 
w„hn the detection process. For example, ti,e face detector 230 receives a contix,! signal 
from the lens conti-ol arrangement 190 to indicate the current focus and zoom settings of U,e 
lens 110. These can assist the face detector by givfag an iniua, indication of the expected 
.mage s,ze of any faces that may be present to the foreground of the image. In this regard i, 
.s noted that the focus and zoom settings between them define a,e expected separati« 
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between the camcorder 100 and a person being filmed, and also the magnification of the lens 
110. From these two attributes, based upon an average face size, it is possible to calculate 
the expected size (in pixels) of a face in the resulting image data. 

A conventional (known) speech detector 240 receives audio infonnation from the 
5 audio processing an-angement 170 and detects the presence of speech in such audio 
infonnation. The presence of speech may be an indicator that the likelihood of a face being 
present in the corresponding images is higher than if no speech is detected. 

Finally, the GSM infonnation 22Q and shot infonnation (from the control 210) are 
supphed to the face detector 230, to indicate shot boundaries and those shots considered to 
) be most useful by the user. 

Of course, if the camcorder is based upon the analogue recording technique, fimher 
analogue to digital converters (ADCs) may be required to handle the image and audio 
information. 

The present embodiment uses a face detection techmque ananged as two phases 
Figure 3 is a schematic diagram illustrating a training phase, and Figure 4 is a schematic 
diagram illustrating a detection phase. 

Unlike some previously proposed face detection methods (see References 4 and 5 
below), the present method is based on modelling the face in parts instead of as a whole The 
parts can either be blocks centred over the assumed positions of the facial features (so-called 
"selective sampling") or blocks sampled at regular intervals over the face (so-called "regular 
sampling"). The present description will cover primarily regular sampling, as this was found 
in empincal tests to give the better results. 

In the training phase, an analysis process is applied to a set of images known to 
contam faces, and (optionally) another set of images ('Wace images") known not to 
contain faces. The analysis process builds a mathematical model of facial and nonfacial 
features, against which a test image can later be compared (in the detection phase). 

So. to build the mathematical model (the training process 310 of Figure 3). the basic 
Steps are as follows: 

1. From a set 300 of face images nomialised to have the same eye positions, each face is 
sampled regularly into small blocks. 

2. Attributes are calculated for each block; these attributes are explained further below. 

3. The attributes are quantised to a manageable number of different values. 

4. The quantised attributes are then combined to generate a single quantised value in 
respect of that block position. 
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5. The single quantised value is then recorded as an entry in a histogram, such as the 
schematic histogram of Figure 5. The collective histogram infomiation 320 in respect of all 
of the block positions in all of the training images fonns the foundation of the mathematical 
model of the facial features. 

One such histogram is prepared for each possible block position, by repeating the 
above steps in respect of a large number of test face images. The test data are described 
further in Appendix A below. So, in a system which uses an array of 8 x 8 blocks, 64 
histograms are prepared. Jn a later part of the processing, a test quantised attribute is 
compared with the histogram data; the fact that a whole histogram is used to model the data 
means that no assumptions have to be made about whether it follows a parameterised 
distribution, e.g. Gaussian or otherwise. To save data storage space (if needed), histograms 
which are similar can be merged so that the same histogram can be reused for different block 
positions. 

hi the detection phase, to apply the face detector to a test image 350. successive 
windows in the test image are processed 340 as follows: 

6. The window is sampled regularly as a series of blocks, and attributes in respect of 
each block are calculated and quantised as in stages 1-4 above. 

7. Corresponding "probabihties" for the quantised attribute values for each block 
position are looked up from the corresponding histograms. That is to say. for each block 
position, a respective quantised attribute is generated and is compared with a histogram 
previously generated in respect of that block position. The way in which the histograms give 
rise to "probability" data will be described below. 

8. All the probabihties obtained above are multipHed together to fonn a final probability 
which is compared against a threshold in order to classify the window as "face" or 
"nonface". It will be appreciated that the detection result of "face" or "nonface" is a 
probability-based measure rather than an absolute detection. Sometimes, an image not 
containing a face may be wrongly detected as "face", a so-called false positive. At other 
times, an image containing a face may be wrongly detected as "nonface", a so-called false 
negative. It is an aim of any face detection system to reduce the proportion of false positives 
and the proportion of false negatives, but it is of course understood that to reduce these 
proportions to zero is difficult, if not impossible, with current technology 

As mentioned above, in the training phase, a set of "nonface" images can be used to 
generate a corresponding set of "nonface" histograms. Then, to achieve detection of a face, 
the "probability" produced from the nonface histograms may be compared with a separate' 
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threshold, so that the probabihty has to be under the threshold for the test window to contain 
a face. Alternatively, the ratio of the face probabihty to the nonface probability could be 
compared with a threshold. 

Extra training data may be generated by applying "synthetic variations" 330 to the 
original training set. such as variations in position, orientation, size, aspect ratio, background 
scenery, lighting intensity and frequency content. 

The derivation of attributes and their quantisation will now be described In the 
present technique, attributes are measured with respect to so-called eigenblocks. which are 
core blocks (or eigenvectors) representing different types of block which may be present in 
the wmdowed image. The generation of eigenblocks will first be described with reference to 
Figure 6. 

Eigenblock creation 

The attributes in the present embodiment are based on so-called eigenblocks The 
15 eigenblocks were designed to have good representational ability of the blocks in the training 
set. Therefore, they were created by perfonning principal component analysis on a large set 
of blocks fi-om the training set. This process is shown schematically in Figure 6 and 
described in more detail in Appendix B. 

20 Training the System 

Experiments were performed with two different sets of training blocks. 

Eigenblock set T 

Initially, a set of blocks were used that were taken from 25 face images in the training 
set. The 16x16 blocks were sampled every 16 pixels and so were non-overiapping This 
sampling is shown in Figure 6. As can be seen. 16 blocks are generated from each 64x64 
trammg image. This leads to a total of 400 fraining blocks overall. 

The first 10 eigenblocks generated from these training blocks are shown in Figure 7. 

30 Eigenblock set TT 

A second set of eigenblocks was generated from a much larger set of training blocks 
These blocks were taken from 500 face images in the training set. In this case, the 16x16 
blocks were sampled every 8 pixels and so overlapped by 8 pixels. This generated 49 blocks 
from each 64x64 fraining image and led to a total of 24.500 fraining blocks. 



25 
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The first 12 eigenblocks generated from these training blocks are shown in Figure 8. 
Empirical results show that eigenblock set U gives slightly better results than set I 
This is because it is calculated from a larger set of training blocks taken from face images 
and so IS perceived to be better at representing the variations in faces. However, the 
• improvement in performance is not large. 

Building the Histo^ ram« 

A histogram was built for each sampled block position within the 64x64 face image 
The number of histograms depends on the block spacing. For example, for block spacing of 
16 pixels, there are 16 possible block positions and thus 16 histograms are used. 

The process used to build a histogram representing a single block position is shown 
m Figure 9. The histograms are created using a large training set 400 of M face images. For 
each face image, the process comprises: 

• Extracting 410 the relevant block from a position (ij) in the face image. 

. Calculating the eigenblock-based attributes for the block, and detennining the relevant 
bm number 420 from these attributes. 

• Incrementing the relevant bin number in the histogram 430. 

This process is repeated for each of M images in the training set, to create a 
histogram that gives a good representation of the distribution of frequency of occurrence of 
the attributes. Ideally, M is very large, e.g. several thousand. This can more easily be 
achieved by using a training set made up of a set of original faces and several hundred 
synthetic variations of each original face. 

Generating the histogr am bin mirnhPi- 

A histogram bin number is generated from a given block using the following process 
as shown in Figure 10. The 16x16 block 440 is extracted from the 64x64 window or face 
image. The block is projected onto the set 450 of ^ eigenblocks to generate a set of 
"eigenblock weights". These eigenblock weights are the "attributes" used in this 
implementation. They have a range of -I to +1 . This process is described in more detail 
.n Appendix B.Each weight is quantised into a fixed number of levels, L, to produce a set of 
quantised attributes 470, ^,,i = l..A. The quantised weights are combined into a single 
value as follows: 
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where the value generated, h. is the histogram bin number 480. Note that the total number of 
bins in the histogram is given by £^ . 

The bin "contents", i.e. the frequency of occurrence of the set of attributes giving rise 
to that bin number, may be considered to be a probability value if it is divided by the nuLber 
5 of training images M. However, because the probabilities are compared with a threshold 
there is in fact no need to divide through by M as this value would cancel out in the' 
. calculations. So, in the following discussions, the bin "contents" will be referred to as 
"probability values", and treated as though they are probabihty values, even though in a strict 
sense they are in fact frequencies of occurrence. 

The above process is used both in the training phase and in the detection phase. 

Face Detection Phasp 

The face detection process involves sampling the test image with a moving 64x64 
wmdow and calculating a face probability at each window position. 

The calculation of the face probability is shown in Figure 11. For each block position 
m the window, the block's bin number 490 is calculated as described in the previous section 
Usmg the appropriate histogram 500 for the position of the block, each bin number is looked 
up and the probability 510 of that bin number is detennined. The sum 520 of the logs of 
these probabilities is then calculated across all the blocks to generate a face probability 
value, (otherwise referred to as a log likelihood value). 

This process generates a probability "map" for the entire test image. Jn other words a 
probabihty value is derived in respect of each possible window centre position across the 
image. The combination of all of these probabihty values into a rectangular (or whatever) 
shaped array is then considered to be a probabihty "map" corresponding to that image. 

This map is then inverted, so that the process of finding a face involves finding 
mmima in the inverted map. A so-called distance-based technique is used. This technique 
can be summarised as follows: The map (pixel) position with the smallest value in the 
mverted probability map is chosen. If this value is larger than a threshold (TD) no more 
faces are chosen. This is the termination criterion. Otherwise a face-sized block 
corresponding to the chosen centre pixel position is blanked out (i.e. omitted from the 
followmg calculations) and the candidate face position finding procedure is repeated on the 
rest of the image until the termination criterion is reached. 
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Nopface method 

The nonface model comprises an additional set of histograms which represent the 
probabihty distribution of attributes in nonface images. The histograms are created in exactly 
the same way as for the face model, except that the training images contain examples of 
nonfaces instead of faces. 

During detection, two log probability values are computed, one using the face model 
and one using the nonface model. These are then combined by simply subtracting the 
nonface probability from the face probability: 

P = P — p 

combined face ^non/aee 

Pcombine, is then used instead of P^^^ to produce the probability map (before 
inversion). 

Note that the reason that />„,,^„^ is subtracted from P^^^ is because these are log 
probability values. 



15 Histogram Examp les 

Figures 12a to 12f show some examples of histograms generated by the training 
process described above. 

Figures 12a, 12b and 12c are derived from a training set of face images, and Figures 
12d, 12e and 12f are derived from a training set of nonface images. In particular: 

20 





Face histograms 


Nonface histograms 


Whole histogram 


Figure 12a 


Figure 12d 


Zoomed onto the main peaks at about h=1500 


Figure 12b 


Figure 12e 


A turther zoom onto the region about h=l 570 


Figure 12c 


Figure 12f 



It can clearly be seen that the peaks are in different places in the face histogram and 
the nonface histograms. 



Multiscale face detection 

In order to detect faces of different sizes in the test image, the test image is scaled by 
a range of factors and a distance (i.e. probability) map is produced for each scale. In Figures 
13a to 13c the images and their corresponding distance maps are shown at three different 
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scales. The method gives the best response (highest probability, or minimum distance) for 
the large (central) subject at the smallest scale (Fig 13a) and better responses for the smaller 
subject (to the left of the main figure) at the larger scales. (A darker colour on the map 
represents a lower value in the inverted map. or in other words a higher probability of there 
being a face).Candidate face positions are extracted across different scales by first finding 
the position which gives the best response over all scales. That is to say. the highest 
probability (lowest distance) is established amongst all of the probabiUty maps at all of the 
scales. This candidate position is the first to be labelled as a face. The window centred over 
that face position is then blanked out fi-om the probability map at each scale. The size of the 
window blanked out is proportional to the scale of the probability map. 

Examples of this scaled blanking-out process are shown in Figures 13a to 13c. In 
particular, the highest probability across all the maps is found at the left hand side of the 
largest scale map (Figure 13c). An area 530 corresponding to the presumed size of a face is 
blanked off in Figure 13c. Corresponding, but scaled, areas 532, 534 are blanked off in the 
smaller maps. 

Areas larger than the test window may be blanked off in the maps, to avoid 
overlapping detections. In particular, an area equal to the size of the test window surrounded 
by a border half as wide/long as the test window is appropriate to avoid such overlapping 
detections. 

Additional faces are detected by searching for the next best response and blanking 
out the corresponding windows successively. 

The intervals allowed between the scales processed are influenced by the sensitivity 
of the method to variations in size. It was found in this preliminary study of scale invariance 
that the method is not excessively sensitive to variations in size as faces which gave a good 
response at a certain scale often gave a good response at adjacent scales as well. 

The above description refers to detecting a face even though the size of the face in the 
image is not known at the start of the detection process. Another aspect of multiple scale 
face detection is the use of two or more parallel detections at different scales to validate the 
detection process. This can have advantages if. for example, the face to be detected is 
partially obscured, or the person is wearing a hat etc. 

Figures 13d to 13g schematically illustrate this process. During the training phase, 
the system is trained on windows (divided into respective blocks as described above) which 
smround the whole of the test face (Figure 13d) to generate "fiill face" histogram data and 
also on windows at an expanded scale so that only a central area of the test face is included 
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(Figure 13e) to generate "zoomed in" histogram data. This generates two sets of histogram 
data. One set relates to the "full face" windows of Figure 13d. and the other relates to the 
"central face area" windows of Figure 13e. 

During the detection phase, for any given test window 536, the window is applied to 
two different scalings of the test image so that in one (Figure 13f) the test window surrounds 
the whole of the expected size of a face, and in the other (Figure 13g) the test window 
encompasses the central area of a face at that expected size. These are each processed as 
described above, being compared with the respective sets of histogram data appropriate to 
the type of window. The log probabilities from each parallel process are added before the 
comparison with a threshold is applied. 

Putting both of these aspects of multiple scale face detection together leads to a 
particularly elegant saving in the amount of data that needs to be stored. 

In particular, in these embodiments the multiple scales for the arrangements of 
Figures 13a to 13c are arranged in a geometric sequence. In the present example, each scale 
in the sequence is a factor of different to the adjacent scale in the sequence. Then, for 
the parallel detection described with reference to Figures 13d to 13g, the larger scale, central 
area, detection is earned out at a scale 3 steps higher in the sequence, that is. 2'/' times larger 
than the "full face" scale, using attribute data relating to the scale 3 steps higher in the 
sequence. So, apart from at extremes of the range of multiple scales, the geometric 
progression means that the parallel detection of Figures 13d to 13g can always be carried out 
using attribute data generated in respect of another multiple scale three steps higher in the 
sequence. 

The two processes (multiple scale detection and parallel scale detection) can be 
combined in various ways. For example, the multiple scale detection process of Figures 13a 
to 13c can be applied first, and then the parallel scale detection process of Figures 13d to 13g 
can be applied at areas (and scales) identified during the multiple scale detection process. 
However, a convenient and efficient use of the attribute data may be achieved by: 

• deriving attributes in respect of the test window at each scale (as in Figures 13a to 13c) 

• comparing those attributes with the "full face" histogram data to generate a "full face" set 
of distance maps 

. comparing the attributes with the "zoomed in" histogram data to generate a "zoomed in" 
set of distance maps 
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. for each scale n. combining the "Ml face" distance map for scale n with the "zoomed in" 
distance map for scale n+3 

. deriving face positions from the combined distance maps as described above with 
reference to Figures 13a to 13c 

5 Further parallel testing can be perfomied to detect different poses, such as looking 

straight ahead, looking partly up, down, left, right etc. Here a respective set of histogram 
data IS required and the results are preferably combined using a "max" function, that is the 
pose giving the highest probability is carried forward to thresholding, the others being 

discarded. 

Face Tracking 

A face tracking algorithm will now be described. The tracking algorithm aims to 
improve face detection performance in image sequences. 

The initial aim of the tracking algorithm is to detect every face in every frame of an 
image sequence. However, it is recognised that sometimes a face in the sequence may not be 
detected. In these circumstances, the tracking algorithm may assist in inteipolating across 
the missing face detections. 

Ultimately, the goal of face tracking is to be able to output some useful metadata 
from each set of frames belonging to the same scene in an image sequence. TOs might 

include: 

• Number of faces. 

• "Mugshot" (a colloquial word for an image of a person's face, derived from a term 
referring to a police file photograph) of each face. 

• Frame number at which each face first appears. 

• Frame number at which each face last appears. 

. Identity of each face (either matched to faces seen in previous scenes, or matched to a 
face database) - this requires some face recognition also. 

The tracking algorithm uses the results of the face detection algorithm run 
independently on each frame of the image sequence, as its starting point. Because the face 
detection algorithm may sometimes miss (not detect) faces, some method of inteipolating the 
missing faces is useful. To this end, a Kahnan filter was used to predict the next position of 
the face and a skin colour matching algorithm was used to aid tracking of faces. In addition 



P/14932.GB 



14 



because the face detection algorithm often gives rise to false acceptances, some method of 
rejecting these is also usefiil. 

The algorithm is shown schematically in Figure 14. 

The algorithm will be described in detail below, but in summary, input video data 
545 (representing the image sequence) is supplied to a face detector of the type described in 
this application, and a skin colour matching detector 550. The face detector attempts to 
detect one or more faces in each image. When a face is detected, a Kahnan filter 560 is 
established to track the position of that face. The Kalman filter generates a predicted 
position for the same face in the next image in the sequence. An eye position comparator 
570, 580 detects whether the face detector 540 detects a face at that position (or within a 
certam threshold distance of that position) in the next image. If this is found to be the case, 
then that detected face position is used to update the Kalman filter and the process continues.' 

If a face is not detected at or near the predicted position, then a skin colour matching 
method 550 is used. This is a less precise face detection technique which is set up to have a 
lower threshold of acceptance than the face detector 540. so that it is possible for the skin 
colour matching technique to detect (what it considers to be) a face even when the face 
detector camiot make a positive detection at that position. If a "face" is detected by skin 
colour matching, its position is passed to the Kalman filter as an updated position and the 
process continues. 

If no match is found by either the face detector 450 or the skin colour detector 550. 
then the predicted position is used to update the Kalman filter. 

All of these results are subject to acceptance criteria (see below). So. for example, a 
face that is tracked throughout a sequence on the basis of one positive detection and flie 
remainder as predictions, or the remainder as skin colour detections, will be rejected. 
A separate Kahnan filter is used to track each face in the tracking algorithm. 
In order to use a Kalman filter to track a face, a state model representing the face 
must be created. In the model, the position of each face is represented by a 4-dimensional 
vector containing the co-ordinates of the left and right eyes, which in turn are derived by a 
predetermined relationship to the centre position of the window and the scale being used: 



FirstEyeX 
FirstEyeY 
SecondEyeX 
SecondEyeY 



P/14932.GB 



15 



where k is the frame number. 

The current state of the face is represented by its position, velocity and acceleration, 
in a 12-dimensional vector: 



pik) 



First Face Detected 

The tracking algorithm does nothing until it receives a frame with a face detection 
result indicating that there is a face present. 

A Kalman filter is then initialised for each detected face in this frame. Its state is 
initialised with the position of the face, and with zero velocity and acceleration: 

'pik) 
zAk)= 0 
0 

It is also assigned some other attributes: the state model error covariance, Q and the 
observation error covariance, R. The eiror covariance of the Kalman filter, P. is also 
initialised. These parameters are described in more detail below. At the beginning of the 
following frame, and every subsequent frame, a Kalman filter prediction process is carried 
out. 



Kalman Filter Prediction Prnggss 

For each existing Kabnan filter, the next position of the face is predicted using the 
standard Kabnan filter prediction equations shown below. The filter uses the previous state 
(at frame k-1) and some other internal and external variables to estimate the cuirent state of 
the filter (at frame k). 

State.prediction equation: i^{k} = o{k,k - lX{k - 1) 

Covariance prediction equation: P, (k) = o{k, k - [k - l>D(;t, k-lf + Q{k) 

where z,{k) denotes the state before updating the filter for frame k, i.(>t-l) denotes die 

state after updating the filter for frame k-l (or the initialised state if it is a new filter), and 
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<I>{k,k-l) is the state transition matrix. Various state transition matrices were experimented 
with, as described below. Similarly, P^{k) denotes the filter's error covariance before 
updating the filter for fi-ame k andP„(^-l) denotes the filter's error covariance after 
updating the filter for the previous fi-ame (or the initialised value if it is a new filter). P,{k) 
can be thought of as an internal variable in the filter that models its accuracy. 

Q{k) is the error covariance of the state model. A high value of Q(k) means that the 
predicted values of the fiher's state (i.e. the face's position) will be assumed to have a high 
level of error. By tuning this parameter, the behaviour of the filter can be changed and 
potentially improved for face detection. 

State Transition Matrix 

The state transition matrix, 0(^,^-l), detenBines how the prediction of the next 
state is made. Using the equations for motion, the following matrix can be derived for 



0(^,yt-l) = 



where O, is a 4x4 zero matrix and /, is a 4x4 identity matrix. At can simply be set to 1 (i.e. 
units of t are firame periods). 

This state transition matrix models position, velocity and acceleration. However, it 
was found that the use of acceleration tended to make the face predictions accelerate towards 
the edge of the picture when no face detections were available to correct the predicted state. 
Therefore, a simpler state transition matrix without using acceleration was preferred: 



I,Ad O, 



The predicted eye positions of each Kalman filter, z,{k), are compared to all face 
detection results in the current fi-ame (if there are any). If the distance between the eye 
positions is below a given threshold, then the face detection can be assumed to belong to the 
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10 



15 



20 



same face as that being modelled by the Kahnan filter. The face detection result is then 
treated as an observation, y{k), of the face's current state: 



y 



Pik) 
0 
0 



where p^k) is the position of the eyes in the face detection result. This observation is used 
during the Kalman filter update stage to help correct the prediction. 

Skin Colour Matching 

Skin colour matching is not used for faces that successfiilly match face detection 
results. Skin colour matching is only performed for faces whose position has been predicted 
by the Kahnan filter but have no matching face detection result in the current frame, and 
therefore no observation data to help update the Kahnan filter. 

In a first technique, for each face, an elliptical area centred on the face's previous 
position is extracted from the previous frame. An example of such an area 600 within the 
face window 610 is shown schematically in Figure 16. A colour model is seeded using the 
chrominance data from this area to produce an estimate of the mean and covariance of the Cr 
and Cb values, based on a Gaussian model. 

An area around the predicted face position in the current frame is then searched and 
the position that best matches the colour model, again averaged over an elliptical area, is 
selected. If the colour match meets a given similarity criterion, then this position is used as 
an observation, y{k), of the face's current state in the same way described for face detection 
results in the previous section. 

Figures 15a and 15b schematically illustrate the generation of the search area. In 
particular, Figure 1 5a schematically illustrates the predicted position 620 of a face within the 
next image 630. hi skin colour matching, a search area 640 surrounding the predicted 
25 position 620 in the next image is searched for the face. 

If the colour match does not meet the similarity criterion, then no reliable observation 
data is available for the current frame. Instead, the predicted state, z,{k) is used as the 
observation: 

y{k)=z,{k) 

The skin colour matching methods described above use a simple Gaussian skin 
colour model. The model is seeded on an elliptical area centred on the face in the previous 
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frame, and used to find the best matching elliptical area in the current frame. However, to 
provide a potentially better performance, two further methods were developed: a colour 
histogram method and a colour mask method. These will now be described. 

Colour Histogram Method . 

In this method, instead of using a Gaussian to model the distribution of colour in the 
tracked face, a colour histogram is used. 

For each tracked face in the previous frame, a histogram of Cr and Cb values within a 
square window around the face is computed. To do this, for each pixel the Cr and Cb values 
are first combined into a single value. A histogram, is then computed that measures the 
frequency of occurrence of these values in the whole window. Because the number of 
combined Cr and Cb values is large (256x256 possible combinations), the values are 
quantised before the histogram is calculated. 

Having calculated a histogram for a tracked face in the previous frame, the histogram 
is used in the current frame to try to estimate the most likely new position of the face by 
finding the area of the image with the most similar colour distribution. As shown 
schematically in Figures 15a and 15b, this is done by calculating a histogram in exactly the 
same way for a range of window positions within a search area of the current frame. This 
search area covers a given area around the predicted face position. The histograms are then 
compared by calculating the mean squared error (MSE) between the original histogram for 
the tracked face in the previous frame and each histogram in the current frame. The 
estimated position of the face in the current frame is given by the position of the minimum 
MSE. 

Various modifications may be made to this algorithm, including: 

• Using three channels (Y, Cr and Cb) instead of two (Cr, Cb). 

• Varying the number of quantisation levels. 

• Dividing the window into blocks and calculating a histogram for each block. In this way, 
the colour histogram method becomes positionally dependent. The MSE between each 
pair of histograms is summed in this method. 

• Varying the number of blocks into which the window is divided. 

Varying the blocks that are actually used - e.g. omitting the outer blocks which might 
only partially contain face pixels. 
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For the test data used in empirical trials of these techniques, the best results were 
achieved using the following conditions, although other sets of conditions may provide 
equally good or better results with different test data: 

• 3 channels (Y, Cr and Cb). 

• 8 quantisation levels for each channel (i.e. histogram contains 8x8x8 = 512 bins). 

• Dividing the windows into 1 6 blocks. 

• Using all 16 blocks. 



15 



20 



25 



30 



Colour Mask Method 

10 This method is based on the method first described above. It uses a Gaussian skin 

colour model to describe the distribution of pixels in the face. 

In the method first described above, an elliptical area centred on the face is used to 
colour match faces, as this may be perceived to reduce or minimise the quantity of 
background pixels which might degrade the model. 

In the present colour mask model, a similar elliptical area is still used to seed a colour 
model on the original tracked face in the previous frame, for example by applying the mean 
and covariance of RGB or YCrCb to set parameters of a Gaussian model (or alternatively, a 
default colour model such as a Gaussian model can be used, see below). . However, it is not 
used when searching for the best match in the current frame. Instead, a mask area is 
calculated based on the distribution of pixels in the original face window from the previous 
frame. The mask is calculated by finding the 50% of pixels in the window which best match 
the colour model. An example is shown in Figures 17a to 17c. In particular. Figure 17a 
schematically illustrates the initial window under test; Figure 17b schematically illustrates 
the elliptical window used to seed the colour model; and Figure 17c schematically illustrates 
the mask defined by the 50% of pixels which most closely match the colour model. 

To estimate the position of the face in the current frame, a search area around the 
predicted face position is searched (as before) and the "distance" from the colour model is 
calculated for each pixel. The "distance" refers to a difference from the mean, normalised in 
each dimension by the variance in that dimension. An example of the resultant distance 
image is shown in Figure 18. For each position in this distance map (or for a reduced set of 
sampled positions to reduce computation time), the pixels of the distance image are averaged 
over a mask-shaped area. The position with the lowest averaged distance is then selected as 
the best estimate for the position of the face in this frame. 
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This method thus differs from the original method in that a mask-shaped area is used 
in the distance image, instead of an elliptical area. This allows the colour match method to 
use both colour and shape information. 

Two variations are proposed and were implemented in empirical trials of the 
techniques: 

(a) Gaussian skin colour model is seeded using the mean and covariance of Cr and Cb 
from an elliptical area centred on the tracked face in the previous frame. 

(b) A default Gaussian skin colour model is used, both to calculate the mask in the 
previous frame and calculate the distance image in the cvirrent frame. 

The use of Gaussian skin colour models will now be described further. A Gaussian 
model for the skin colour class is built using the chrominance components of the YCbCr 
colour space. The similarity of test pixels to the skin colour class can then be measured. This 
method thus provides a skin colour likelihood estimate for each pixel, independently of the 
eigenface-based approaches. 

Let w be the vector of the CbCr values of a test pixel. The probability of w belonging 
to the skin colour class S is modelled by a two-dimensional Gaussian: 
p(w\ S) = exp[-i(>v-//Jsr(H^-^jl 

where the mean and the covariance matrix 2, of the distribution are (previously) 
estimated from a training set of skin colour values. 

Skin coloiir detection is not considered to be an effective face detector when used on its own. This is 
because there can be many areas of an image that are similar to skin colour but are not necessarily faces, for 
example other parts of the body. However, it can be used to improve the performance of the eigenblock-based 
approaches by using a combined approach as described in respect of the present face tracking system.The 
decisions made on whether to accept the face detected eye positions or the colour matched 
eye positions as the observation for the Kahnan filter, or whether no observation was 
accepted, are stored. These are used later to assess the ongoing validity of the faces modelled 
by each Kalman filter. 

Kalman Filter Update Step 

The update step is used to determine an appropriate output of the filter for the current 
frame, based on the state prediction and the observation data. It also updates the internal 
variables of the filter based on the error between the predicted state and the observed state. 

The following equations are used in the update step: 
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Kalman gain equation K{k) = P, {k)H ^ {kiH{k)P, {k)H ^ (k) + 

State update equation z„ (k) = z, {k) + K{kjy{k) - H{k)z, (k)] 

Covariance update equation (k) = (k) - K{k)H{k)P^ {k) 

Here, K{k) denotes the Kalman gain, another variable internal to the Kalman filter. It 
is used to determine how much the predicted state should be adjusted based on the observed 
state, y{k). 

H{k) is the observation matrix. It determines which parts of the state can be 
observed. In our case, only the position of the face can be observed, not its velocity or 
acceleration, so the following matrix is used for H{k) : 



'h O, O, 

o, o, o, 
o, o, o. 



R{k) is the error covariance of the observation data. In a similar way to Q{k), a high 
value of R{k) means that the observed values of the filter's state (i.e. the face detection 
results or colour matches) will be assumed to have a high level of error. By tuning this 
parameter, the behaviour of the filter can be changed and potentially improved for face 
detection. For our experiments, a large value of R{k) relative to Q{k) was found to be 
suitable (this means that the predicted face positions are treated as more reliable than the 
observations). Note that it is permissible to vary these parameters fi-om fi-ame to fi-ame. 
Therefore, an interesting future area of investigation may be to adjust the relative values of 
R{k) and Q{k) depending on whether the observation is based on a face detection result 
(reliable) or a colour match (less reliable). 

For each Kalman filter, the updated state, z^{k), is used as the final decision on the 
position of the face. This data is output to file and stored. 

Unmatched face detection results are treated as new faces. A new Kalman filter is 
initialised for each of these. Faces are removed which: 
• Leave the edge of the picture and/or 
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• Have a lack of ongoing evidence supporting them (when there is a high proportion of 
observations based on Kalman filter predictions rather than face detection results or 
colom- matches). 

For these faces, the associated Kalman filter is removed and no data is output to file. 
5 As an optional difference fi-om this approach, where a face is detected to leave the picture, 
the tracking results up to the fi-ame before it leaves the picture may be stored and treated as 
valid face tracking results (providing that the results meet any other criteria applied to 
vahdate tracking results). 

These rules may be formalised and built upon by bringing in some additional 
10 variables: 



prediction acceptance _ratiojhreshold 

15 

detection j2cceptancejratio_threshold 

20 

min_frames 

25 



If, during tracking a given face, the proportion 
of accepted Kalman predicted face positions 
exceeds this threshold, then the tracked face is 
rejected. 

This is currently set to 0.8. 

During a final pass through all the fi-ames, if for 
a given face the proportion of accepted face 
detections falls below this threshold, then the 
tracked face is rejected. 
This is currently set to 0.08. 

During a final pass through all the fi-ames, if for 
a given face the number of occurrences is less 
than min_fi-ames, the face is rejected. This is 
only likely to occur near the end of a sequence. 
min_fi-ames is currently set to 5. 



final jrediction_acceptancejratiojhreshold and minjramesl During a final pass 

through all the fi-ames, if for a given tracked 
face the number of occurrences is less than 
min_fi-ames2 AND the proportion of accepted 
Kalman predicted face positions exceeds the 
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min_eye_spacing 

10 
15 



final_prediction_acceptanceratio_threshold, 
the face is rejected. Again, this is only hkely to 
occur near the end of a sequence. 
final_prediction_acceptanceratio_threshold is 
currently set to 0.5 and min_frames2 is 
currently set to 10. 

Additionally, faces are now removed if they are 
tracked such that the eye spacing is decreased 
below a given minimum distance. This can 
happen if the Kalman filter falsely believes the 
eye distance is becoming smaller and there is no 
other evidence, e.g. face detection results, to 
correct this assumption. If uncorrected, the eye 
distance would eventually become zero. As an 
optional alternative, a minimum or lower limit 
eye separation can be forced, so that if the 
detected eye separation reduces to the minimum 
eye separation, the detection process continues 
to search for faces having that eye separation, 
but not a smaller eye separation. 



It is noted that the tracking process is not limited to tracking through a video 
sequence in a forward temporal direction. Assuming that the image data remain accessible 

25 (i.e. the process is not real-time, or the image data are buffered for temporary continued use), 
the entire tracking process could be carried out in a reverse temporal direction. Or, when a 
first face detection is made (often part-way through a video sequence) the tracking process 
could be initiated in both temporal directions. As a further option, the tracking process could 
be run in both temporal directions through a video sequence, with the results being combined 

30 so that (for example) a tracked face meeting the acceptance criteria is included as a vaUd 
result whichever direction the tracking took place. 

In the tracking system shown schematically in Figure 14, three further features are 
included. 
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Shot boundary data 560 (from metadata associated with the image sequence under 
test; or metadata generated within the camera of Figure 2) defines the limits of each 
contiguous "shot" within the image sequence. The Kahnan filter is reset at shot boundaries, 
and is not allowed to carry a prediction over to a subsequent shot, as the prediction would be 
5 meaningless. 

User metadata 542 and camera setting metadata 544 are supphed as inputs to the face 
detector 540. These may also be used in a non-tracking system. Examples of the camera 
setting metadata were described above. User metadata may include information such as: 

• type of programme (e.g. news, interview, drama) 

10 • script information such as specification of a "long shot" , "medium close-up" etc 
(particular types of camera shot leading to an expected sub-range of face sizes), how 
many people involved in each shot (again leading to an expected sub-range of face sizes) 
and so on 

• sports-related information - sports are often filmed from fixed camera positions using 
15 standard views and shots. By specifying these in the metadata, again a sub-range of face 

sizes can be derived 

The type of programme is relevant to the type of face which may be expected in the 
images or image sequence. For example, in a news programme, one would expect to see a 
single face for much of the image sequence, occupying an area of (say) 10% of the screen. 

20 The detection of faces at different scales can be weighted in response to this data, so that 
faces of about this size are given an enhanced probability. Another alternative or additional 
approach is that the search range is reduced, so that instead of searching for faces at all 
possible scales, only a subset of scales is searched. This can reduce the processing 
requirements of the face detection process. In a software-based system, the software can run 

25 more quickly and/or on a less powerful processor. In a hardware-based system (including 
for example an application-specific integrated circuit (ASIC) or field programmable gate 
array (FPGA) system) the. hardware needs may be reduced. 

The other types of user metadata mentioned above may also be applied in this way. 
The "expected face size" sub-ranges may be stored in a look-up table held in the memory 30, 

30 for example. 

As regards camera metadata, for example the current focus and zoom settings of the 
lens 110, these can also assist the face detector by giving an initial indication of the expected 
image size of any faces that may be present in the foreground of the image. In this regard, it 
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is noted that the focus and zoom settings between them define the expected separation 
between the camcorder 100 and a person being filmed, and also the magnification of the lens 
110. From these two attributes, based upon an average face size, it is possible to calculate 
the expected size (in pixels) of a face in the resulting image data, leading again to a sub- 
5 range of sizes for search or a weighting of the expected face sizes. 

Advantages of the tracking algorithm 

The face tracking technique has three main benefits: 
• It allows missed faces to be filled in by using Kalman filtering and skin colour tracking 
10 in fi-ames for which no face detection results are available. This increases the true 

acceptance rate across the image sequence. 
© It provides face linking: by successfiilly tracking a face, the algorithm automatically 
knows whether a face detected in a fiiture fi-ame belongs to the same person or a different 
person. Thus, scene metadata can easily be generated fi-om this algorithm, comprising the 
15 number of faces in the scene, the fi-ames for which they are present and providing a 

representative mugshot of each face. 
© False face detections tend to be rejected, as such detections tend not to carry forward 
between images. 

Figures 19a to 19c schematically illustrate the use efface tracking when applied to a 
20 video scene. 

In particular. Figure 19a schematically illustrates a video scene 800 comprising 
successive video images (e.g. fields or fi-ames) 810. 

In this example, the images 810 contain one or more faces. In particular all of the 
images 810 in the scene include a face A, shown at an upper left-hand position within the 
25 schematic representation of the image 810. Also, some of the images include a face B shown 
schematically at a lower right hand position within the schematic representations of the 
images 810. 

A face tracking process is applied to the scene of Figure 19a. Face A is tracked 
reasonably successfully throughout the scene. In one image 820 the face is not tracked by a 
30 direct detection, but the skin colour matching techniques and the Kalman filtering techniques 
described above mean that the detection can be continuous either side of the "missing" image 
820. The representation of Figure 19b indicates the detected probability of a face being 
present in each of the images. It can be seen that the probability is highest at an image 830, 
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and so the part 840 of the image detected to contain face A is used as a "picture stamp" in 
respect of face A. Picture stamps will be described in more detail below. 

Similarly, face B is detected with different levels of confidence, but an image 850 
gives rise to the highest detected probability of face B being present. Accordingly, the part 
5 of the corresponding image detected to contain face B (part 860) is used as a picture stamp 
for face B within that scene. (Alternatively, of course, a wider section of the image, or even 
the whole image, could be used as the picture stamp). 

Figure 20 schematically illustrates a display screen of a non-linear editing system. 

Non-linear editing systems are well established and are generally implemented as 
10 software programs running on general purpose computing systems such as the system of 
Figure 1. These editing systems allow video, audio and other material to be edited to an 
output media product in a manner which does not depend on the order in which the 
individual media items (e.g. video shots) were captured. 

The schematic display screen of Figure 20 includes a viewer area 900, in which video 
15 clips be may viewed, a set of clip icons 910, to be described further below and a '^timeline" 
920 including representations of edited video shots 930, each shot optionally containing a 
picture stamp 940 indicative of the content of that shot. 

At one level, the face picture stamps derived as described with reference to Figures 
19a to 19c could be used as the picture stamps 940 of each edited shot so, within the edited 
20 length of the shot, which may be shorter than the originally captured shot, the picture stamp 
representing a face detection which resulted in the highest face probability value can be 
inserted onto the time line to show a representative image from that shot. The probability 
values may be compared with a threshold, possibly higher than the basic face detection 
threshold, so that only face detections having a high level of confidence are used to generate 
25 picture stamps in this way. If more than one face is detected in the edited shot, the face with 
the highest probability may be displayed, or alternatively more than one face picture stamp 
may be displayed on the time line. 

Time lines in non-linear editing systems are usually capable of being scaled, so that 
the length of line corresponding to the full width of the display screen can represent various 
30 different time periods in the output media product. So, for example, if a particular boundary 
between two adjacent shots is being edited to frame accuracy, the time Hne may be 
"expanded" so that the width of the display screen represents a relatively short time period in 
the output media product. On the other hand, for other purposes such as visualising an 
overview of the output media product, the time line scale may be contracted so that a longer 
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tm,e penod may be viewed across Uie width of the display screen. So. depending on fte 
level of expansion or contraction of the time line scale, there may be less or more screen area 
available to display each edited shot contributing to the outpm media product. 

In an expanded time line scale, there may well be more than enough room to fit one 
5 picture stamp (derived as shown in Figures 19a ,o 19c) for each edited shot making up the 
output media product. However, as the time line scale is contracted, this may no longer be 
possible. Jr. such cases, the shots may be grouped together in to "sequences", where each 
sequence is such tha, it is displayed a. a display screen size large enough to accommodate a 
phase picture stamp. From within the sequence, then, the face picnire stamp having the 
' highest con-esponding probability value is selected for display. If no face is detected withm 
a sequence, an arbitrary hnage. or no image, can be displayed on the timeline. 

Figure 20 also shows schematically two "face timelines" 925. 935. TTiese scale with 
the -^ain" timeline 920. Each face timeline relates to a single tracked face, and shows me 
portions of the output edited sequence containing that tracked face. I, is possible tha, the 
user may observe that certain ftces relate to the same person but have not been associated 
with one another by the h-acWng algorithm. The user can "link" these faces by selecting the 
relevant paru of the face timelines (using a standard Windows™ selection technique for 
muMple items) and then clicking on a "link" screen button (no, shown). ITie face toelmes 
would men reflect the-linkage of me whole ^oup of face detections into one longer tracked 
face. Figures 21a and 21b schematically illustrate nvo variants of clip icons 910> and 910" 
These are displayed on me display screen of Figure 20 to allow fte user to select individual' 
clips for inclusion in me time line and editing of meir start and end positions (in and out 
points). So. each clip icon represents fte whole of a respective clip stored on fte system 

In Figure 21a. a clip icon 910" is represented by a single face picture stamp 912 and 
a text label area 914 which may include, for example, time code infomiadon defining me 
position and lengm of ma. chp. h an alterative anangement shown in Figure 21b mo« 
man one face pictare stamp 916 may be included by using a multi-part clip icon 

Anofter possibility for the clip icons 910 is that ftey provide a "face summary" so 
ma, all detected faces are shown as a set of clip icons 910. in the order in which they appear 
(eimer m fte source material or in fte edited output sequence). Again, faces mat are me 
same person but which have no. been associated wim one anomer by fte hacking algoriftm 
can be linked by .he user subjectively obse™ng ftat ftey are fte same face. The user could 
select mc relevant face clip icons 910 (using a standard Windows™ selection technique for 
multiple items) and ften click on a "hnk" screen button (not shown). The tracking data 
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would then reflect the linkage of the whole group of face detections into one longer tracked 



face. 



5 



Figures 22a to 22c schematically illustrate a gradient pre-processing technique. 

It has been noted that image windows showing little pixel variation can tend to be 
5 detected as faces by a face detection arrangement based on eigenfaces or eigenblocks 
Therefore, a pre-processing step is proposed to remove areas of httle pixel variation from the 
face detection process. In the case of a multiple scale system (see above) the pre-processing 
Step can be earned out at each scale. 

The basic process is that a "gradient test" is applied to each possible window position 
10 across the whole image. A predetennined pixel position for each window position such as 
the prxel at or nearest the centre of that window position, is flagged or labelled in 
dependence on the results of the test applied to that window. If the test shows that a window 
has httle pixel variation, that window position is not used in the face detection process 

A first step is illustrated in Figure 22a. This shows a window at an arbitrary window 
5 position m the image. As mentioned above, the pre-processing is repeated at each possible 
wxndow position. Referring to Figure 22a, although the gradient pre-processing could be 
apphed to the whole window, it has been found that better results are obtained if the pre- 
processmg is applied to a central area 1000 of the test window 1010. 

Referring to Figure 22b, a gradient-based measure is derived from the window (or 
3 from the central area of the window as shown in Figure 22a), which is the average of the 
absolute differences between all adjacent pixels 1011 in both the horizontal and vertical 
directions, taken over the window. Each window centre position is labelled with this 
gradient-based measure to produce a ^adient "map" of the image. The resulting gradient 
map IS then compared with a threshold gradient value. Any window positions for which the 
gradient-based measure lies below the threshold gradient value are excluded from the face 
detection process in respect of that image. 

Alternative gradient-based measures could be used, such as the pixel variance or the 
mean absolute pixel difference from a mean pixel value. 

The gradient-based measure is preferably carried out in respect of pixel luminance 
values, but could of course be applied to other image components of a colour image. 

Figure 22c schematically illustrates a gradient map derived from an example image 
Here a lower gradient area 1070 (shown shaded) is excluded from face detection, and only a 
higher gradient area 1080 is used. The embodiments described above have related to a 
face detection system (involving training and detection phases) and possible uses for it in a 
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camera-recorder and an editing system. It will be appreciated that there are many other 
possrble uses of such techniques, for exanrple (and not limited to) secudty surveillance 
systems, media handling in general (such as video tape recorder controllers), video 
conferencing systems and the like. 

It will be appreciated that the embodiments of «,e invention described above may of 
course be implemented, at leas, in part, using soft».are-cont,.lled data processing appa,^ 
For example, one or more of dre components schematically illustrated or described above may 
be .mplemented as a software-controlled general pun>ose data processing device or a bespoke 
program controlled data processing device such as an application speciSc integrated circuit a 
field ptogrammable gate anay or the like. It will be appreciated that a computer prog,™ 
provrdtng such software or program conm>l and a s,orage. transmission or other providing 
medium by which such a computer prog^n is stored are envisaged as aspects of the p^sent 
invention. 
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The list of references and appendices follow. For the avoidance of doubt, it is noted 
that the hst and the appendices foim a part of the present description. 
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Appendix A; Training Face Sets 

One database consists of many thousand images of subjects standing in front of an indoor 
background. Another training database used in experimental implementations of the above 
techniques consists of more than ten thousand eight-bit greyscale images of human heads 
with views ranging from frontal to left and right profiles. The skilled man will of course 
understand that various different training sets could be used, optionally being profiled to 
reflect facial characteristics of a local population. 
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Appendix B - Eigenblocks 

In the eigenface approach to face detection and recognition (References 4 and 5), 
each m-hy-n face image is reordered so that it is represented by a vector of length mn. Each 
image can then be thought of as a point in /««-dimensional space. A set of images maps to a 
collection of points in this large space. 

Face images, being similar in overall configuration, are not randomly distributed in 
this m«-dimensional image space and therefore they can be described by a relatively low 
dimensional subspace. Using principal component analysis (PCA). the vectors that best 
account for the distribution of face images within the entire image space can be found. PCA 
involves detennining the principal eigenvectors of the covaiiance matrix corresponding to 
the original face images. These vectors define the subspace of face images, often referred to 
as the face space. Each vector represents an m-hy-n image and is a linear combination of the 
original face images. Because the vectors are the eigenvectors of the covariance matrix 
corresponding to the original face images, and because they are face-like in appearance, they 
are often referred to as eigenfaces [4]. 

When an unknown image is presented, it is projected into the face space. In this way. 
it is expressed in terms of a weighted sum of eigenfaces. 

In the present embodiments, a closely related approach is used, to generate and apply 
so-called "eigenblocks" or eigenvectors relating to blocks of the face image. A grid of 
blocks is applied to the face image (in the training set) or the test window (during the 
detection phase) and an eigenvector-based process, very similar to the eigenface process, is 
applied at each block position. (Or in an alternative embodiment to save on data processing, 
the process is apphed once to the group of block positions, producing one set of eigenblocks 
for use at any block position). The skilled man will understand that some blocks, such as a 
central block often representing a nose feature of the image, may be more significant in 
deciding whether a face is present. 



Calculating Eigenhlopks 

The calculation of eigenblocks involves the following steps: 
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(1) . A training set of A^^ images is used. These are divided into image blocks each of 
size mxn. So. for each block position a set of image blocks, one from that position in each 
image, is obtained: 

(2) . A normalised training set of blocks {/' fj, , is calculated as follows: 

Each image block, from the original training set is normalised to have a mean of 
zero and an L2-norm of i; to produce a respective normalised image block, /' . 
For each image block, Jj ,t = \ ..Nj. : 

/' _ V-niean_V 
/„' - mean 



where meanjj =^'£j;^lj[i,j] 



"*« 1=1 7=1 



10 



and ||/; -mean_/;||= |;2(/;[/,y]_„,ean_/;)^ 

V i-i y=i 

(i.e. the L2-norm of (// - mean_/^' )) 
(3). A training set of vectors {x'}Z is formed by lexicographic reordering of the pixel 
elements of each image block, /'. i.e. Each m-hy-n image block, /', is reordered into a 
vector, j:', of length //=w«. 

15 (4). The set of deviation vectors, D = {x'g , is calculated. D has rows and A^^ 
columns. 

(5). The covariance matrix, 2 , is calculated: 

T. = DD'' 

£ is a symmetric matrix of size A' x N. 
20 (7). The whole set of eigenvectors, P, and eigenvalues, / = 1,,., AT , of the covariance 
matrix, S , are given by solving: 

A = P^'EP 

Here, A is an A^ x A^ diagonal matrix with the eigenvalues, X. , along its diagonal (in 
order of magnitude) and P is an A^x ;V matrix containing the set of A^ eigenvectors, each of 
25 length N. This decomposition is also known as a Karhunen-Loeve Transform (KLT). 
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The eigenvectors can be thought of as a set of features that together characterise the 
variation between the blocks of the face images. They form an orthogonal basis by which 
any image block can be represented, i.e. in principle any image can be represented without 
error by a weighted sum of the eigenvectors. 

If the number of data points in the image space (the number of training images) is 
less than the dimension of the space {Nr<N\ then there will only be Nj. meaningful 
eigenvectors. The remaining eigenvectors will have associated eigenvalues of zero. Hence, 
because typically Nj^ < N , all eigenvalues for which i > Nj. will be zero. 

Additionally, because the image blocks in the training set are similar in overall 
configuration (they are all derived from faces), only some of the remaining eigenvectors will 
characterise very strong differences between the image blocks. These are the eigenvectors 
with the largest associated eigenvalues. The other remaining eigenvectors with smaller 
associated eigenvalues do not characterise such large differences and therefore they are not 
as useful for detecting or distinguishing between faces. 

Therefore, in PCA, only the M principal eigenvectors with the largest magnitude 
eigenvalues are considered, where M < Nj. i.e. a partial KLT is performed. In short, PCA 
extracts a lower-dimensional subspace of the KLT basis corresponding to the largest 
magnitude eigenvalues. 

Because the principal components describe the strongest variations between the face 
images, in appearance they may resemble parts of face blocks and are referred to here as 
eigenblocks. However, the term eigenvectors could equally be used. 

Face Detection using Eigenblocks 

The similarity of an unknown image to a face, or its faceness, can be measured by 
deteraiining how well the image is represented by the face space. This process is carried out 
on a block-by-block basis, using the same grid of blocks as that used in the training process. 

The first stage of this process involves projecting the image into the face space. 

Projection of an Image into Face Space 

Before projecting an image into face space, much the same pre-processing steps are 
performed on the image as were performed on the training set: 
(1), A test image block of size mxnis obtained: . 
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(2) . The original test image block, is normalised to have a mean of zero and an L2- 
norm of 1, to produce the normalised test image block, / : 

7^ -mean,/^ 
||/^-mean_/J| 

where mean_/„ = £ 2 -^o [' . A 

and -mean_/Jl= |x;Z(/,[^y]-mean_/J^ 

(i.e. the L2-norm of (/^ -mean_/^)) 

(3) . The deviation vectors are calculated by lexicographic reordering of the pixel 
elements of the image. The image is reordered into a deviation vector, , of length N^mn. 

After these pre-processing steps, the deviation vector, x, is projected into face space 
using the following simple step: 

(4) . The projection into face space involves transforming the deviation vector, x, into its 
eigenblock components. This involves a simple multiplication by the M principal 
eigenvectors (the eigenblocks), TJ, z = 1,..,M . Each weight is obtained as follows: 

where is the eigenvector. 

The weights 3;,., z=l,..,M, describe the contribution of each eigenblock in 

representing the input face block. 

Blocks of similar appearance will have similar sets of weights while blocks of 
different appearance will have different sets of weights. Therefore, the weights are used here 
as feature vectors for classifying face blocks during face detection. 
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CLAIMS 

1. Video face detection apparatus in which a test image from a video sequence is 
compared with an image property model derived from image properties of a region detected 
5 to contain a face in a preceding image in the video sequence; the apparatus comprising: 

means for selecting a predetermined proportion pf pixels in the region detected to 
contain a face in the preceding image which most closely match the image property model 
derived in respect of that region, thereby deriving a pixel mask; 

means for comparing pixels in the test image defined by the pixel mask with the 
10 image property model, the mask being applied at more than one image position within the 
test image; a face being detected in the test image at a mask position corresponding to a 
lowest average difference between the image property model and pixels defined by the mask 
at that position. 

15 2. Apparatus according to claim 1, in which the image property model is a colour 
model. 

3. Apparatus according to claim 1 or claim 2, in which the colour model is a Gaussian 
model of colour distribution. 

20 

' 4. Apparatus according to claim 1 or claim 2, in which the colour model represents a 
colour distribution in at least a part of at least one image of the video sequence. 

5. Apparatus according to any one of the preceding claims, in which the mask is applied 
25 to the test image at positions within a test region surrounding the image position of the 

detected face in the preceding image. 

6. Apparatus according to claim 5, in which the test region is a rectangular region. 

30 7. Apparatus according to any one of the preceding claims, in which the predetermined 
proportion is 50% of the pixels. 

8. Video conferencing apparatus comprising apparatus according to any one of the 
preceding claims. 
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9. Surveillance apparatus comprising apparatus according to any one of claims 1 to 7. 

10. Face detection apparatus substantially as hereinbefore described with reference to the 
5 accompanying drawings. 

11. A video face detection method in which a test image from a video sequence is 
compared with an image property model derived from image properties of a region detected 
to contain a face in a preceding image in the video sequence; the method comprising the 

10 steps of: 

selecting a predetermined proportion of pixels in the region detected to contain a face 
in the preceding image which most closely match the image property model derived in 
respect of that region, thereby deriving a pixel mask; and 

comparing pixels in the test image defined by the pixel mask with the image property 
15 model, the mask being applied at more than one image position within the test image; a face 
being detected in the test image at a mask position corresponding to a lowest average 
difference between the image property model and pixels defined by the mask at that position, 

12. A video face detection method substantially as hereinbefore described with reference 
20 to the accompanying drawings. 

13. Computer software having program code for carrying out a method according to 
claim 11 or claim 12. 

25 14. A providing medium for providing program code according to claim 13. 

15. A medium according to claim 14, the medium being a storage medium. 

16. A medium according to claim 14, the medium being a transmission medium. 

30 
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ABSTRACT 
FACE DETECTION 

5 Video face detection apparatus in which a test image from a video sequence is 

compared with an image property model derived from image properties of a region detected 
to contain a face in a preceding image in the video sequence comprises: 

means for selecting a predetermined proportion of pixels in the region detected to 
contain a face in the preceding image which most closely match the image property model 

10 derived in respect of that region, thereby deriving a pixel mask; and 

means for comparing pixels in the test image defined by the pixel mask with the 
image property model, the mask being applied at more than one image position within the 
test image; a face being detected in the test image at a mask position corresponding to a 
lowest average difference between the image property model and pixels defined by the mask 

15 at that position. 

Figure 16. 
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